Webscraping v R

Workshop FF UK 5.10.2023 2️⃣
Vybrané kapitoly z analýzy dat

Renata Topinkova

LMU Munich
📫 renata.topinkova[at]lmu.de

What does API stand for?

API = Application Programming Interfaces

  • designated points of data access

How does it work?

Source: https://www.geeksforgeeks.org/what-is-an-api/

Starting with APIs


Isn’t there an R package for that?

📦 WHO, guardianapi, spotifyR, nytimes, wbstats, RedditExtractoR


Are you sure?

Google, Github


If you’re SURE sure… Generic package

📦 httr, httr2

Extra danger zone

Starting with APIs


📖 STEP 1 : Read the documentation

  • Endpoint = designated point for data collection (often > 1)

  • Parameters = How can I narrow down what I want to get? What can I get? What values does the API accept?

  • Authentication = Do I need API token? How do I get it? Where do I put it?

  • Rate limits = How much can I download in a minute/day?

  • ToS = What are you allowed to do with the data? Can you publish it? In what form?

Now, how does it look in R?

Code
library(httr2)

resp <- request("https://api.nationalize.io") %>% 
  # specify the parameters for our query - key = value pairs
  req_url_query(name = "Renata") %>% 
  # until we do req_perform, nothing is sent to the API
  req_perform()

resp

Code
library(httr2)

resp <- request("https://api.nationalize.io") %>% 
  # specify the parameters for our query - key = value pairs
  req_url_query(name = "Renata") %>% 
  # until we do req_perform, nothing is sent to the API
  req_perform()

resp

Http statuses

The good boi

Likely API token issue

Likely bad query

Exceeding rate limit

Server side issue

Just cute

Http statuses

Starts with…

  • 2 - good, success
  • 4 - your fault (user-side error)
  • 5 - their fault (server-side error)

If unsure, check doggos

Code
library(httr2)

resp <- request("https://api.nationalize.io") %>% 
  # specify the parameters for our query - key = value pairs
  req_url_query(name = "Renata") %>% 
  # until we do req_perform, nothing is sent to the API
  req_perform()

resp

Code
library(httr2)

resp <- request("https://api.nationalize.io") %>% 
  # specify the parameters for our query - key = value pairs
  req_url_query(name = "Renata") %>% 
  # until we do req_perform, nothing is sent to the API
  req_perform()

resp

Code
library(httr2)

resp <- request("https://api.nationalize.io") %>% 
  # specify the parameters for our query - key = value pairs
  req_url_query(name = "Renata") %>% 
  # until we do req_perform, nothing is sent to the API
  req_perform()

resp_headers(resp)

httr2 workflow

1. Specify the endpoint for your query

Code
library(httr2)
req <- request("endpoint")


2. Specify the query itself

  • req_url_query() or req_url_path()
Code
req %>% 
  # key and values depend on the API documentation
  req_url_query(key = "value")

httr2 workflow

3. Authenticate if the API requires it

  • No one shoe fits all - depends on the API, read through documentation & hope for the best

  • Different functions available:

    • req_auth_bearer_token()
    • req_oauth_* functions for oAuth
    • req_headers()
    Code
    req %>% 
      # key and values depend on the API documentation
      req_url_query(key = "value") 
      ## authenticate if needed here 

httr2 workflow

4. OPTIONAL: Test it out, see what your are planning to send

Code
req %>% 
  # key and values depend on the API documentation
  req_url_query(key = "value") %>% 
  req_dry_run()
  • This can be especially usefull if you are not sure whether you have constructed the query correctly
  • Does not send data to API - just a preview for you

httr2 workflow

5. Send the request - req_perform()

Code
resp <- req %>% 
  # key and values depend on the API documentation
  req_url_query(key = "value") %>% 
  req_perform()
  • Until req_perform() is called, nothing gets sent to the API!
  • Make you sure you assign the response to an object, so you don’t have to call the API multiple times with the same query

httr2 workflow

6. Parse the response

  • Different functions based on the type of response you get resp_body_* (json, xml, html)

Leaving the httr2

7. Wrangle the data - as_tibble, map_*, bind_rows, and others..

  • or write your own function, if else fails

8. Unnest if needed - unnest, unnest_wider, unnest_longer

9. Analysis!

Constructing the query

  1. Specify the endpoint for your query
Code
library(httr2)
endpoint <- request("https://api.nationalize.io")
  1. Specify the query itself

req_url_path() adds /

req_url_query() adds ? after endpoint, key-value pairs are separated by &

Code
endpoint |> 
  req_url_query(name = "Renata")

Constructing the query

Code
library(httr2)
endpoint <- request("https://api.nationalize.io")

endpoint %>% 
  req_url_query(name = "Renata") |> 
  req_dry_run()
GET / HTTP/1.1
Host: api.nationalize.io?name=Renata
User-Agent: httr2/0.2.3 r-curl/5.0.1 libcurl/7.84.0
Accept: */*
Accept-Encoding: deflate, gzip


If we wanted to change/add some heading, we could do it with req_headers()

Code
endpoint |>  
  req_url_query(name = "Renata") |> 
  req_headers("User-Agent" = "Renata Topinkova | renata.topinkova@lmu.de") |> 
  req_dry_run()
GET / HTTP/1.1
Host: api.nationalize.io?name=Renata
Accept: */*
Accept-Encoding: deflate, gzip
User-Agent: Renata Topinkova | renata.topinkova@lmu.de

Constructing the query

Make sure you assign it to a new object!

Code
resp <- endpoint |> 
  req_url_query(name = "Renata") |> 
  req_perform()


Explore what you got

Code
resp

Parsing responses

  1. Parse the response
  • Different functions based on the type of response you get resp_body_* (json, xml, html)
Code
resp |> 
  resp_body_json()
$count
[1] 172839

$name
[1] "Renata"

$country
$country[[1]]
$country[[1]]$country_id
[1] "CZ"

$country[[1]]$probability
[1] 0.168


$country[[2]]
$country[[2]]$country_id
[1] "BR"

$country[[2]]$probability
[1] 0.144


$country[[3]]
$country[[3]]$country_id
[1] "PL"

$country[[3]]$probability
[1] 0.132


$country[[4]]
$country[[4]]$country_id
[1] "LT"

$country[[4]]$probability
[1] 0.084


$country[[5]]
$country[[5]]$country_id
[1] "SK"

$country[[5]]$probability
[1] 0.076

Parsing responses II

Often useful to examine the structure - can help us figure out why wrangling is failing

Code
resp  |>  
  resp_body_json() |> 
  str()
List of 3
 $ count  : int 172839
 $ name   : chr "Renata"
 $ country:List of 5
  ..$ :List of 2
  .. ..$ country_id : chr "CZ"
  .. ..$ probability: num 0.168
  ..$ :List of 2
  .. ..$ country_id : chr "BR"
  .. ..$ probability: num 0.144
  ..$ :List of 2
  .. ..$ country_id : chr "PL"
  .. ..$ probability: num 0.132
  ..$ :List of 2
  .. ..$ country_id : chr "LT"
  .. ..$ probability: num 0.084
  ..$ :List of 2
  .. ..$ country_id : chr "SK"
  .. ..$ probability: num 0.076

Wrangling the data

  • Danger, one shoe may not fit all
Code
library(tidyverse)

resp |>  
  resp_body_json() |>  
  as_tibble()
# A tibble: 5 × 3
   count name   country         
   <int> <chr>  <list>          
1 172839 Renata <named list [2]>
2 172839 Renata <named list [2]>
3 172839 Renata <named list [2]>
4 172839 Renata <named list [2]>
5 172839 Renata <named list [2]>

Unnest if needed

unnest, unnest_wider, unnest_longer

Code
resp %>% 
  resp_body_json() |> 
  as_tibble() |>  
  unnest_wider(country)
# A tibble: 5 × 4
   count name   country_id probability
   <int> <chr>  <chr>            <dbl>
1 172839 Renata CZ               0.168
2 172839 Renata BR               0.144
3 172839 Renata PL               0.132
4 172839 Renata LT               0.084
5 172839 Renata SK               0.076

Additional features

There are many other useful functions in httr2, look up the documentation


All functions requesting something start with req_*, all functions working with the response start with resp_*

  • req_throttle()
  • req_retry()
  • resp_is_error

… etc.

Genderize & nationalize APIs

Nationalize API

an API for predicting nationality from a name


Genderize API

a simple API to predict the gender of a person given their name

Genderize & nationalize APIs

May seem silly but…

e.g., Holman et al. (2018) - estimating gender gap in science

Let’s try it out

Open the 02_1_API_wo_package_exercise.qmd file.

Note

Make sure to make a project where your work will reside.

25:00

Omdb

Omdb API

The OMDb API is a RESTful web service to obtain movie information, all content and images on the site are contributed and maintained by our users.

Let’s try it out II

Open the `02_2_API_wo_package_exercise.qmd file.

Note

Make sure you place your api key inside you project.

25:00